Deprecate implicit conversions
between char8_t , char16_t , and char32_t
- Document number:
- P3695R0
- Date:
2025-05-18 - Audience:
- EWG, SG16
- Project:
- ISO/IEC 14882 Programming Languages — C++, ISO/IEC JTC1/SC22/WG21
- Author:
- Jan Schultke <[email protected]>
- GitHub Issue:
- wg21.link/P3695R0/github
- Source:
- github.com/Eisenwave/cpp-proposals/blob/master/src/deprecate-unicode-conversion.cow
,
, and
are bug-prone and thus harmful to the language.
I propose to deprecate them.
Contents
Introduction
It's not hypothetical. This really happens.
The underlying problem
Scope
What about "safe" comparisons?
What about char
and wchar_t
?
What about conversions with integers?
What comes after deprecation?
Impact on existing code
Replacement for deprecated behavior
Implementation experience
Wording
References
1. Introduction
Implicit conversions between
and
invite bugs:
always fails if
is a UTF-8 code unit
because it is equivalent to
,
and a UTF-8 code unit cannot have this value.
The assertion succeeds because Ԡ (U+0520) is UTF-8 encoded as
,
,
and NBSP is U+00A0,
so the
value matches the second UTF-8 code unit of U+0520.
:
Note that the "bad comparison" occurs between two
in
,
which demonstrates that implicit conversions in general are bug-prone, not just comparisons.
We obviously don't want to deprecate
.
Conversions "the other way" (e.g.
→
)
are obviously bug-prone too because information is lost,
but such bugs can already be caught by all major compilers' warnings,
and they are problematic for the same reason as
→
,
not because of anything specific to character types.
The listed bugs are interesting precisely because no information is lost.
1.1. It's not hypothetical. This really happens.
These kinds of bugs are not far-fetched hypotheticals either;
I have written such bugs myself,
and have had them contributed
to my syntax highlighter [µlight],
which makes extensive use of
and
.
Very early in development, I have realized how dangerous these implicit conversions are,
so most functions in the style of
have a deleted overload:
,
but technically,
can have the values
and
,
so it is undetectable.
1.2. The underlying problem
The underlying problem is that
is
.
In general, it is meaningless to compare code units with different encodings.
To be fair, Unicode character types aren't strictly required to store Unicode code units.
However, that is their primary purpose, and the assumption holds true for any Unicode
2. Scope
I propose to deprecate implicit conversions between
,
, and
.
As demonstrated above, these are extremely bug-prone.
2.1. What about "safe" comparisons?
In comparisons between code units,
certain ranges of code points yield the expected result.
For example,
is
because all Unicode encodings are ASCII-compatible,
so the numeric value of anything in the basic latin block (≤ U+007F)
will have the same single-code-unit value in UTF-8, UTF-16, and UTF-32.
However, even those should be deprecated because:
- Keeping these valid would essentially leak implementation details of Unicode encodings into the C++ core language, which seems like unclean design.
-
To rely on this "feature", the developer needs to memorize which code points are "safe to use".
It is not obvious whether
orc == U ' € '
are always safe (hint: the latter one is), and it's quite likely that someone uses this "feature" accidentally.c == U ' $ ' -
It would make this "feature" (or lack thereof) harder to teach than it needs to be.
The rule can be very simple: different
cannot be converted to one another. Simple rules are easy to teach.charN_t
2.2. What about char
and wchar_t
?
and
have existed for too long to make any deprecation
of their behavior realistic at this point.
There are approximately ten trillion lines of C++ code using
[citation needed].
It would still be plausible to deprecate say, conversions between
and
.
However, there's a good chance that these are valid
because UTF-8 text is often stored in
,
and UTF-16 or UTF-32 text is often stored in
.
On the contrary,
and
almost certainly use different encodings.
2.3. What about conversions with integers?
It is quite common to compare character types to integer types.
For example, we may write
to check whether a character falls into the basic latin block.
There is nothing exceptionally bug-prone about comparing with say,
instead of
,
so we are not interested in deprecating character/integer conversions.
2.4. What comes after deprecation?
The goal is to eventually remove these conversions entirely. Since the behavior is easily detected (§4. Implementation experience) and easily replaced (§3.1. Replacement for deprecated behavior), removal should be feasible within one or two revisions of the language.
Furthermore, I don't believe that having "tombstone behavior" would be necessary.
That is, allowing the conversion to happen but making the program ill-formed if it happens.
The reason is that
,
, and
rarely appear in overload sets that include types that are not characters.
3. Impact on existing code
It is not trivial to estimate how much code would be affected by a deprecation like this.
However, that is ultimately not what makes or breaks this proposal.
The goal is not to deprecate a rarely used feature to give it new meaning,
like
prior to [P1161R3].
The goal is to deprecate a bug-prone and harmful feature to make the language safer.
The longer we wait, the more mistakes will be made using
and other types.
C++ will undoubtedly get improved support for the Unicode character types over time,
making them used more frequently,
so we better deal with this problem now than never.
3.1. Replacement for deprecated behavior
If the new deprecation warnings spot a bug like in §1. Introduction, some work will be required to fix it, but the deprecation will have done its job.
If the comparison is obviously safe, such as
with
,
the resolution is usually trivial, like
.
This could even be done automatically with tools like clang-tidy.
4. Implementation experience
Corentin Jabot has recently implemented a
However the warning is more conservative than the proposed deprecation; it does not warn on "safe comparisons" (§2.1. What about "safe" comparisons?).
5. Wording
The following changes are relative to [N5008].
Change [basic.fundamental] paragraph 9 as follows:
The types
,
, and
are collectively called Unicode character types.
Type
denotes a distinct type
whose underlying type is
.
Types
and
denote distinct types
whose underlying types are
and
,
respectively, in
Change [conv.integral] paragraph 1 as follows:
A prvalue of an integer type
can be converted to a prvalue of another integer type
.
The conversion is deprecated ([depr.conv.unicode]) if
-
andS
are two different Unicode character types ([basic.fundamental]) andD -
the conversion is not necessitated by a
([expr.static.cast]).static_cast
A prvalue of an unscoped enumeration type can be converted to a prvalue of an integer type.
Insert a new paragraph immediately following [conv.integral] paragraph 1:
A prvalue of an unscoped enumeration type can be converted to a prvalue of an integer type.
Change [expr.arith.conv] paragraph 1 as follows:
Many binary operators that expect operands of arithmetic or enumeration type cause conversions and yield result types in a similar way. The purpose is to yield a common type, which is also the type of the result. This pattern is called the usual arithmetic conversions, which are defined as follows:
- The lvalue-to-rvalue conversion ([conv.lval]) is applied to each operand and the resulting prvalues are used in place of the original operands for the remainder of this section.
- [...]
-
Otherwise, each operand is converted to a common type
. The conversion is deprecated if the operands are of two different Unicode character types ([depr.conv.unicode]). The integral promotion rules ([conv.prom]) are used to determine a typeC
and typeT1
for each operand. Then the following rules are applied to determine C:T2 - [...]
Insert a new subclause in [depr] between [depr.local] and [depr.capture.this], containing a single paragraph:
Unicode character conversions [depr.conv.unicode]
The following conversions are deprecated:
-
Integral conversions ([conv.integral])
not necessitated by a
, where the source type and destination type are two different Unicode character types ([basic.fundamental]).static_cast - Usual arithmetic conversions ([expr.arith.conv]) where the operands after lvalue-to-rvalue conversion ([conv.lval]) are two different Unicode character types.